- Figure and Table numbering? –> rather not.
- Figure caption in first column, for ICE plots?
- Or figure caption in separate column with font facing upwards (90% rotated)?
- Features: numeric, not categorical (!)
Oct 2018
This piece of work investigates the influences of weather on bike rentals. …[[todo]]
Summarize your questions and findings in 1 brief paragraph (4-6 sentences max). Your abstract needs to include: what dataset, what question, what method was used, and findings.
Describe the problem you want to solve with the data. It may relate closely with your research question, but your goal here is to make your audience care about the project/problem you are trying to solve. You need to articulate the problem you are exploring and why (and for whom) insight would be valuable.
Two datasets were used: Bike sharing data…
…and weather data from the Canadian government:
DataFrame each for individual bike rides and hourly weather data.The research questions that I wanted to answer with my analysis were:
First, some data exploration was performed, in order to get to know the data and to find out how the number of hourly bike trips is distributed across the investigated time span. Also, the interrelation of features was investigated by means of a correlation heatmap.
In order to find out how well the number of bike rides can be predicted from the data, different models were used As a baseline model, a moving average was calculated to find out how this very simple model can explain the data.
Then, after splitting the data into \(90\%\) training and \(10\%\) test set, different machine learning models were fitted to the data in order to predict the hourly number of bike rides from the available data: Random forest regression, and gradient boosting regression via scikit-learn and xgboost. The most promising model, scikit-learn’s gradient boosting regression, was fitted via a randomized \(4\)-fold cross-validation for indentifying the best hyperparameters. Variable importance was used to identify the most important influence factors, and Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots [1] were used to visualize the influences of the important variables on the number of bike trips.
Choosing the closest weather station:
The Canadian government’s past weather and climate service offers a search by proximity function. Via this service, some sample data of the closest stations to Montreal were downloaded. Each of the data files contains information about the weather stations, including the geographical position (latitude and longitude). These coordinates were plotted on a map (see Figure on the right), and the closest station to the bulk of the data was chosen (station name: MCTAVISH). Only data from this station was used.
Figure: Plot shows all starting stations of a bike trip (red dots), as well as the closest weather stations (blue markers). The closest station in the center is the MCTAVISH weather station (Climate Identifier 7024745, WMO Identifier 71612)
To get a better understanding of the data, the number of hourly bike trips was visualized for the time span between \(2014\) and \(2017\).
The moving average that is shown in the plot (red line) can be interpreted as a baseline model, i.e., the simplest possible model to describe the hourly number of bike rides.
This baseline model explains \(38.8\%\) of the variance \((r^2 = 0.388)\) and has a mean absolute error of \(MAE = 316.2\), which means that on average, the “prediction” for the number of hourly bike rides is wrong by this many bike rides. This includes also winter months with no rides. For a more realistic estimation of model quality, these numbers drop to \(r^2 = .079\) and \(MAE = 510.7\) when only considering the time frame from May to September.
Figure: Number of hourly rides from \(2014\) to \(2017\). Each dot represents the number of trips in one specifc hour. Red line represents a moving average using a window of \(14\) days.
To visualize the relations between the available features, a correlation heatmap is shown on the right. The features are only slightly correlated, with the only exception being temperature and dew point that show an almost perfect (linear) relationship \((r = .93)\).
To avoid problems resulting from this multicollinearity, only temperature was used as a predictor, and dew point was dropped. Despite the fact that that gradient boosting is less influenced by multicollinearity, it might still influence calculations of variable importance [2,3].
Figure: Pearson Correlations between available features in the data.
Table: Model performance measures for different models. \(r^2\) is the amount of variance explained, \(MAE\) stands for mean absolute error. \(train\) and \(test\) specify training and testing set, respectively. For the test set, some performance measures were also re-computed for using only the summer months of the test set (May to September), indicated via the \(summer\) subscripts.
| Model | \(r^2_{train}\) | \(r^2_{test}\) | \(r^2_{summer}\) | \(MAE_{test}\) | \(MAE_{summer}\) |
|---|---|---|---|---|---|
| Gradient Boosting (XGBoost) | \(0.889\) | \(0.860\) | NA | \(158.0\) | NA |
| Random Forest | \(0.913\) | \(0.894\) | NA | \(111.2\) | NA |
| Gradient Boosting (sklearn) | \(0.997\) | \(0.941\) | \(0.933\) | \(85.4\) | \(105.4\) |
The explained variance of the different models tried ranged from \(r_{test}^2 = 0.860\) to \(r_{test}^2 = 0.941\) for the final model (all in the test set; see table). The hyperparameters for the final model (gradient boosting via scikit-learn) were selected via randomized search using \(4\)-fold cross validation (using only the training set).
The final model fits the data very well, explaining \(94.1\%\) of the variance and exhibiting an average error of \(85.4\) rides over all hourly predictions. This error only increases to \(105.4\) if only summer months are considered.
Figure: The figure shows actual (true) values vs. predicted values for the number of hourly bike trips for the final model. A perfect model would yield predictions that are identical to the true values, i.e., all points would be on the \(45°\) diagonal. This model is relatively close.
The most important features that influenced the prediciton of hourly bike rides were temperature and atmospheric pressure. While the former is easily comprehensible, the latter is best understood as a proxy for precipitation, which was not available from the weather data: low pressure is commonly related to rainy weather [4]. While Relative humidity is not as tightly connected to rain [5], it interacts with temperature, e.g., it influences how (high) temperatures are perceived [6].
Further important predictors are the hour of the day and the day of the week, as well as wind direction and speed. How these features influence the predicted number of bike trips is best detailed by specific plots, so-called Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots [1].
Figure:
Individual Conditional Expectation (ICE) plots [1] to quantify the main effects of some of the predictors.
For details, see figure caption.
Figure: Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption … Figure Caption …
Individual Conditional Expectation (ICE) plots [1] to quantify the main effects of some of the predictors.
For details, see figure caption.
Findings Outline:
Feel free to replicate this slide to show multiple findings Present your findings. Include at least one visualization in your presentation (feel free to include more). The visualization should be honest, accessible, and elegant for a general audience. You need not come to a definitive conclusion, but you need to say how your findings relate back to your research question.
If applicable, describe limitations to your findings. For example, you might note that these results were true for British Premier league players but may not be applicable to other leagues because of differences in league structures. Or you may note that your data has inherent limitations. For example, you may not have access to the number of Twitter followers per users so you assumed all users are equally influential. If you had the number of followers, you could weight the impact of their tweet’s sentiment by their influence (# of followers).
Report your overall conclusions, preferably a conclusion per research question
Where did you get your data? Did you use other informal analysis to inform your work? Did you get feedback on your work by friends or colleagues? Etc. If you had no one give you feedback and you collected the data yourself, say so.
1. Goldstein A, Kapelner A, Bleich J, Pitkin E. Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation [Internet]. 2014. Available: https://arxiv.org/pdf/1309.6392.pdf
2. Wheatley J. Random forest or gradient boosting? [Internet]. 2014. Available: http://joewheatley.net/random-forest-or-gradient-boosting/
3. Parr T, Turgutlu K, Csiszar C, Howard J. Beware Default Random Forest Importances [Internet]. 2018. Available: http://explained.ai/rf-importance/index.html
4. Morgan L. Why Does it Rain When the Pressure Is Low? [Internet]. 2017. Available: https://sciencing.com/rain-pressure-low-8738476.html
5. Haby J. RELATIVE HUMIDITY PITFALLS [Internet]. Available: http://www.theweatherprediction.com/habyhints2/564/
6. Dotson JD. How Temperature & Humidity are Related [Internet]. 2018. Available: https://sciencing.com/temperature-ampamp-humidity-related-7245642.html